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Description 

[0001] The present invention relates to a speech endpoint detecting method and, more particularly, to a method for 
detecting a speech period from a speech-bearing signal in a high-noise environment. 

[0002] Speech recognition technology is now in wide use. To recognize speech, it is necessary to detect a speech 
period to be recognized in the input signal. A description will be given of a conventional technique for detecting the 
speech period on the basis of amplitude that is the power of speech. The power herein mentioned is the square-sum 
of the input signal per unit time. Speech usually contains a pitch frequency component, whose power is particularly 
large in the vowel period. On the assumption that a frame in the input signal over which the power of the input signal 
exceeds a certain threshold value is a frame of vowel, the conventional scheme detects, as the speech period, the 
vowel frame together with several preceding and following frames. With this method, however, a problem arises that 
signals of large power, which last for about the same period of time as the duration of a word, are all erroneously 
detected as speech. That is, sounds of large power, such as the sounds of a telephone bell and a closing door, are 
detected as speech. Another problem of this method is that the more the power of background noise increases, the 
harder it is to detect the power period of speech. Hence, in voice control of an instrument in a car, for instance, there 
is a possibility of the instrument becoming uncontrollable or malfunctioning due to a recognition error. 
[0003] Another prior art method is to detect the speech period on the basis of a pitch frequency which is the funda- 
mental frequency of speech. This method utilizes that the pitch frequency of a vowel stationary part falls in the range 
of from 50 to 500 Hz or so. The pitch frequency of the input signal is examined, then the frame in which the pitch 
frequency stays in the above-mentioned frequency range is assumed to be the frame of vowel, and the frame and 
several preceding and following frames are detected as a speech period. With this method, however, a signal with the 
pitch frequency in the frequency range is erroneously detected as speech even if it is noise. In an environment where 
music containing a high pitch component, in general, is floating in the background, the speech period is very likely to 
be erroneously detected owing to the pitch component of the musical sound. Further, since the pitch frequency detecting 
method utilizes the fact that the waveform of human speech assumes high correlation every pitch, the superimposition 
of noise on speech make it impossible to obtain a high correlation value and hence detect the correct pitch frequency, 
resulting in failure to detect speech. 

[0004] In Japanese Patent Application Laid-Open No. 200300/85 there is proposed a method which is aimed at 
increasing the accuracy of detecting start and end points of the speech period. This method defines the start and end 
points of the speech period as the points in time when the signal spectrum undergoes large variations in the vicinities 
of the start and end points of a period in which the power of the input speech signal exceeds a threshold value. Since 
this method is predicated on the detection of the power level of the input signal that exceeds the threshold value, there 
is a very strong possibility of a detection error arising when the speech signal level is low or noise level is high. 
[0005] With the above-described conventional method for detecting the speech period based on the power of speech, 
when the power of background noise is large, it cannot be distinguished from the power of speech and the noise is 
erroneously detected as speech. On the other hand, according to the speech period detecting method based on the 
pitch frequency, when noise is superimposed on speech, there is a case where a stable pitch frequency cannot be 
obtained and hence no speech can be detected. Additionally, in US Patent No. 5,365,592 there is disclosed a method 
which obtains a cepstrum pitch by an FFT analysis of the input signal and, based on the cepstrum pitch, determines 
at every point in time whether the input signal is speech or not. This method is also prone to decision errors due to noise. 
[0006] Further, the document "Instantaneous Spectral Estimation of Non stationary Signals" by Takizawa et.al. , ICAS- 
SP-94, Vol. IV, pp. 329-332 discloses the use of frequency of spectrum variation of a signal for instantaneous spectrum 
estimation. 

[0007] It is therefore an object of the present invention to provide a signal processing method which permits stable 
detection of the speech period from the input signal even in a high-noise environment through utilization of information 
characteristic of speech. 

[0008] According to the present invention, the signal processing method for detecting the speech period in the input 
signal, comprises the steps of: 

(a) obtaining a spectral feature parameter by analyzing the spectrum of the input signal for each predetermined 
analysis window; 

(b) calculating the amount of change in the spectral feature parameter of the input signal per unit time; 

(c) calculating the frequency of variation in the amount of change of the spectral feature parameter over a prede- 
termined analysis frame period longer than the unit time; and 

(d) making a check to see if the frequency of variation falls in a predetermined frequency range and, if so, deciding 
that the input signal of the analysis frame is a speech signal. 

[0009] In the above signal processing method, the step of calculating the amount of change in the spectral feature 
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parameter comprises a step of obtaining a time sequence of feature vectors representing the spectra of the input signal 
at respective points in time, and a step of calculating the dynamic measures through the use of the feature vectors at 
a plurality of points in time and calculating the variation in the spectrum from the norm of the dynamic measures. 
[0010] In the above signal processing method, the frequency calculating step is a step of counting the number of 
peaks of the spectrum variation exceeding a predetermined threshold value and providing the resulting count value 
as the frequency. 

[0011] Alternatively, the frequency calculating step includes a step of calculating the sum total of variations in the 
spectrum of the input signal over the analysis frame period longer than the unit time and the deciding step decides 
that the input signal of the analysis frame period is a speech signal when the value of sum total is within a predetermined 
range of values. 

[0012] The above signal processing method further comprises a step of vector quantizing the input signal for each 
analysis window by referring to a vector code book composed of representative vectors of spectral feature parameters 
of speech prepared from speech data and calculating quantization distortion. When the quantization distortion is smaller 
than a predetermined value and the frequency of variation is within the predetermined frequency range, the deciding 
step (d) decides that the input signal in the analysis window represents the speech period. 

[0013] The above signal processing method further comprises a step of obtaining the pitch frequency, amplitude 
value or correlation value of the input signal for each analysis window and deciding whether the input signal is a vowel. 
When the vowel is detected and the frequency of variation is in the predetermined frequency range, the deciding step 
(d) decides that the input signal in the analysis window is a speech signal. Alternatively, the deciding step (d) counts 
the number of zero crossings of the input signal and, based on the count value, decides whether the input signal is a 
consonant, and decides the speech period on the basis of the decision result and the frequency of variation. 
[0014] According to the present invention, since attention is focused on the frequency of spectrum variation charac- 
teristic of a speech sound, even a noise of large power can be distinguished from speech if it does not undergo a 
spectrum change with the same frequency as does the speech. Accordingly, it is possible to determine if unknown 
input signals of large power, such as a steady-state noise and a gentle sound of music, are speech. Even if noise is 
superimposed on the speech signal, speech can be detected with high accuracy because the spectrum variation of 
the input signal can be detected accurately and stably. Further, a gentle singing voice and other signals relatively low 
in the frequency of spectrum variation can be eliminated or suppressed. 

[0015] The above method is based solely on the frequency of spectrum variation of the input signal, but the speech 
period can be detected with higher accuracy by combining the frequency of spectrum variation with one or more pieces 
of information about the spectral feature parameter, the pitch frequency, the amplitude value and the number of zero 
crossings of the input signal which represent its spectrum envelope at each point in time. 

Fig. 1 is a graph showing the frequency of spectrum change of a speech signal on which the present invention os 
based; 

Fig. 2 is a diagram for explaining an embodiment of the present invention; 
Fig. 3 is a timing chart of a spectrum analysis of a signal; 

Fig. 4 is a diagram showing and speech signal waveforms and the corresponding variations in the dynamic measure 
in the Fig. 2 embodiment; 

Fig. 5 is a diagram showing the results of speech detection in the Fig. 2 embodiment; 

Fig. 6 is a diagram for explaining another embodiment of the present invention which combines the frequency of 

spectrum change with a vector quantization scheme; 

Fig. 7 is a diagram showing the effectiveness of the Fig. 6 embodiment; 

Fig. 8 is a diagram illustrating another embodiment of the present invention which combines the frequency of 
spectrum change with the pitch frequency of the input signal; and 

Fig. 9 is a diagram illustrating still another embodiment of the present invention which combines the frequency of 
spectrum change with the number of zero crossings of the input signal. 

[0016] In accordance with the present invention, a spectrum variation of the input signal is derived from a time se- 
quence of its spectral feature parameters and the speech period to be detected is a period over which the spectrum 
of the input signal changes with about the same frequency as in the speech period. 

[0017] The detection of a change in the spectrum of the input signal begins with calculating the feature vector of the 
spectrum at each point in time, followed by calculating the dynamic feature of the spectrum from feature vectors at a 
plurality of points in time and then by calculating the amount of change in the spectrum from the norm of the dynamic 
feature vector. The frequency or temporal pattern of spectrum variation in the speech period is precalculated and a 
period during which the input signal undergoes a spectrum change similar to the above is detected as the speech 
period. As the spectral feature parameter, it is possible to use spectral envelope information obtainable by an FFT 
spectrum analysis, cepstrum analysis, short-time autocorrelation analysis, or similar spectrum analysis. The spectral 
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feature parameter is usually a sequence of plural values (corresponding to a sequence of spectrum frequencies), which 
will hereinafter be referred to as a feature vector. The dynamic feature may be the difference between time sequences 
of spectral feature parameters, a polynomial expansion coefficient or any other spectral feature parameters as long 
as they represent the spectrum variation. The frequency of spectrum variation is detected by a method capable of 
detecting the degree of spectrum change by counting the number of peaks of the spectrum variation over a certain 
frame time width or calculating the integral of the amount of change in the spectrum. 

[0018] Of sounds, a speech sound is, in particular, a sequence of phonemes and each phoneme has a characteristic 
spectrum envelope. Accordingly, the spectrum changes largely at the boundary between phonemes. Moreover, the 
number of phonemes which are produced per unit time (the frequency of generation of phonemes) in such a sequence 
of phonemes does not differ with languages but is common to general languages. In terms of the spectrum variation 
the speech signal can be characterized as a signal whose spectrum varies with a period nearly equal to the phoneme 
length. This property is not found in other sounds (noises) in the natural world. Hence, by precalculating an acceptable 
range of spectrum variation in the speech period, it is possible to detect, as the speech period, a period in which the 
frequency of occurrence of the spectrum variation of the input signal is in the precalculated range. 
[0019] As methods for analyzing the spectrum of the input signal, there have been known, for example, a method 
of directly frequency analyzing the input signal, a method of FFT (Fast Fourier Transform) analyzing the input signal 
and a method of LPC (Linear Predictive Coding) analyzing the input signal. The following is spectral parameter deriving 
equations by three representative speech spectrum analysis methods. 

20 (a) Spectral parameter <{>(m) by a short-time autocorrelation analysis: 

I N-l-lml 

4>(m) = TT I x(n)x(n + M) (1) 

25 iX n-0 

(b) Spectral parameter S(w) by a short-time spectrum analysis: 
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(c) Spectral parameter Cn by a cepstrum analysis: 



<o , N-l 

I 

k-0 



Cn — N I logjX(k)|exp{j27tkn/N} * (3) 



45 [0020] The spectral parameter by the LPC cepstrum analysis is expressed in the same form as Eq. (3). Furthermore, 
a linear prediction coefficient {otj|i=1, p}, a PARCOR coefficient {Kj|i=1, -,p} and a line spectrum pair LSP also 
represent spectral envelope information of speech signals. These spectral parameters are all expressed by a coefficient 
sequence (vector) and are called acoustic feature vectors. A description will be given typically of the LPC cepstrum 
C={Ci.C2,-",c K }, but any other spectral parameters can also be used. 

so [0021] As referred to previously herein, the principle of the present invention is to decide whether the period of the 
input signal is a speech period, depending on whether the frequency of spectrum variation of the input signal is within 
a predetermined range. The amount of change in the spectrum is obtained as a dynamic measure of speech as de- 
scribed below. The first step is to obtain a time sequence of acoustic parameter vectors of the speech signal by the 
FFT analysis, LPC analysis or some other spectrum analysis. Let it be assumed that a k-dimensional LPC cepstrum 

55 C(t)={c 1 ,c 2 , — , c K } is used as the feature vector at time t. Next, to represent a change in the frequency spectrum of 
speech over a window width 2n (n being the number of discrete points in time) of a certain period, a local movement 
of the cepstrum C(t) is linearly approximated by a weighted least squares method and its inclination A(t) (a linear 
differential coefficient) is obtained as the amount of change in the spectrum (a gradient vector). That is, setting the 
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weight w-w.j, the inclination by the linear approximation is given by the following equation: 



M l >- I iw.c k (t + i)/ I i 3 »v. 



n | n 



(4) 




In the above, a k (t) represents a k-th element of a k-dimensio nal vector A(t)={a 1 (t),a 2 (t),...,a k (t)} which represents the 
dynamic feature of the spectrum at time t, and A(t) is referred to as a delta cepstrum. That is, a^t) indicates a linear 
differential coefficient of a time sequence of k-dimensional cepstrum elements c^t) at time t (see Furui, "Digital Speech 
Signal Processing," Tokai University Press). 

[0022] The dynamic measure D(t) at time t is calculated by the following equation which represents the sum of 
squares of all elements of the delta cepstrum at time t (see Shigeki Sagayama and Fumitada Itakura, "On Individuality 
in a Dynamic Measure of Speech," Proc. Acoustical Society of Japan Spring Conf. 1 979, 3-2-7, pp.589-590, June 1 979). 



That is, the cepstrum C(k) represents the feature of the spectral envelope and the delta cepstrum, which is its linear 
differential coefficient, represents the dynamic feature. Hence, the dynamic measure represents the magnitude of the 
spectrum variation. The frequency SF of the spectrum variation is calculated as the number of peaks of the dynamic 
measures D(t) that exceed a predetermined threshold value D^ during a certain frame period F (an analysis frame), 
or as the sum total (integral) of the dynamic measures D(t) in the analysis frame F. 

[0023] While in the above the dynamic measure D(t) of the spectrum in the case of using the cepstrum C(t) has been 
described as the spectral feature (vector) parameter, the dynamic measure D(t) can be similarly defined as other 
spectral feature parameters which are represented by vector. 

[0024] Speech contains two to three phonemes in 400 msec, for instance, and the spectrum varies corresponding 
to the number of phonemes. Fig. 1 is a graph showing the number of peaks indicating large spectrum variations in the 
unit time (400 msec, which is defined as the analysis frame length F) measured for many frames. Eight pieces of 
speech data by reading were used. In Fig. 1 the abscissa represents the number of times the spectrum variation 
exceeded a value 0.5 per frame and the ordinate the rate at which the respective numbers of peaks were counted. As 
is evident from Fig. 1 , the number of peaks per frame is distributed from once to five times. Though differing with the 
threshold value used to determine peaks or the speech data used, this distribution is characteristic of speech sounds. 
Thus, when the spectrum of the input signal varies once to five times in the 400 msec period, it can be decided as a 
speech signal period. The variation in the spectrum (feature vector) represents the inclination of the time sequence C 
(t) of feature vectors at each point in time. 

[0025] Fig. 2 illustrates an embodiment of the present invention. A signal S input via a signal input terminal 11 is 
converted by an A/D converting part 12 to a digital signal. An acoustic feature extracting part 13 calculates the acoustic 
feature of the converted digital signal, such as its LPC or FFT cepstrums. A dynamic measure calculating part 14 
calculates the amount of change in the spectrum from the LPC cepstrum sequence. That is, the LPC cepstrum is 
obtained every 10 msec by performing the LPC analysis of the input signal for each analysis window of, for example, 
a 20 msec time width as shown on Row A in Fig. 3, by which a sequence of LPC cepstrums C(0), C(1), C(2), ... is 
obtained as shown on row B in Fig. 3. Each time the LPC cepstrum C(t) is obtained, the delta cepstrum A(t) is calculated 
by Eq. (4) from 2n+1 latest LPC cepstrums as shown on Row C in Fig. 3. Fig. 3 shows the case where n=1 . Next, each 
time the delta cepstrum A(t) is obtained, the dynamic measure D(t) is calculated by Eq. (5) as depicted on Row D in 
Fig. 3. 

[0026] By performing the above-described processing over the analysis frame F of a 400 msec time length considered 
to contain a plurality of phonemes, 40 dynamic measures D(t) are obtained. A speech period detecting part 15 counts 
the number of peaks of those of the dynamic measures D(t) which exceed the threshold value D^ and provides the 
count value as the frequency S F of the spectrum variation. Alternatively, the sum total of the dynamic measures D(t) 
over the analysis frame F is calculated and is defined as the frequency Sp of the spectrum variation. 
[0027] The frequency of spectrum variation in the speech period is precalculated, on the basis of which the upper 
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and lower limit threshold values are predetermined. The frame of the input signal which falls in the range from the 
upper and lower limit threshold values is detected as a speech frame. Finally, the speech period detected result is 
output from a detected speech period output part 16. By repeatedly obtaining the frequency S F of spectrum variation 
during the application of the input signal while shifting the temporal position of the analysis frame F by a time interval 
of 20 msec each time, the speech period in the input signal is detected. 

[0028] Fig. 4 is a diagram showing a speech signal waveform and an example of a pattern of the corresponding 
variation in the dynamic measure D(t). The speech waveform data shown on Row A is male speaker's utterances of 
Japanese words /keikai/ and /sasuga/ which means "alert" and "as might be expected," respectively. The LPC cepstrum 
analysis for obtaining the dynamic measure D(t) of the input signal was made using an analysis window 20 ms long 
shifting it by a 10 ms time interval. The delta cepstrum A(t) was calculated over a 100 ms frame width. It is seen from 
Fig. 4 that the dynamic measure D(t) does not much vary in a silent part of a stationary part of speech as shown on 
Row B and that peaks of dynamic measures appear at start and end points of the speech or at the boundary between 
phonemes. 

[0029] Fig. 5 is a diagram for explaining an example of the result of detection of speech with noise superimposed 
thereon. The input signal waveform shown on Row A was prepared as follows: The noise of a moving car was super- 
imposed, with a 0 dB SN ratio, on a signal obtained by concatenating two speakers' utterances of a Japanese word / 
aikawarazu/ which means "as usual," the utterances being separated by a 5 sec silent period. Row B in Fig. 5 shows 
a correct speech period representing the period over which speech is present. Row D shows variations in the dynamic 
measure D(t). Row C shows the speech period detected result automatically determined on the basis of variations in 
the dynamic measure D(t). The dynamic measure D(t) was obtained under the same conditions as in Fig. 4. Accordingly, 
the dynamic measure was obtained every 10 ms. The analysis frame length was 400 ms and the analysis frame was 
shifted in steps of 200 ms. The sum total of the dynamic measures D(t) in the analysis frame period was calculated as 
the frequency S F of the spectrum variation. In this example, the analysis frame F for which the value of this sum total 
exceeded a predetermined value 4.0 was detected as the speech period. While speech periods are not clearly seen 
on the input signal waveform because of low SN ratio, it can be seen that all speech periods were detected by the 
method of the present invention. Fig. 5 indicates that the present invention utilizes the frequency of the spectrum 
variation and hence permits detection of speech in noise. 

[0030] Fig. 6 is a diagram for explaining another embodiment of the present invention, which uses both of the dynamic 
measure and the spectral envelope information to detect the speech period. As is the case with the above-described 
embodiment, the signal input via the signal input terminal 11 is converted by the A/D converting part 13 to a digital 
signal. The acoustic feature extracting part 1 3 calculates, for the converted digital signal, the acoustic feature such as 
LPC or FFT cepstrum. The dynamic measure calculating part 14 calculates the dynamic measure D(t) on the basis of 
the acoustic feature. A vector quantizer 17 refers to a vector quantization code book memory 18, then sequentially 
reads out therefrom precalculated representative vectors of speech features and calculates vector quantization distor- 
tions between the representative vectors and feature vectors of the input signal to thereby detect the minimum quan- 
tization distortion. When the input signal in the analysis window is a speech signal, the acoustic feature vector obtained 
at that time can be vector quantized with a relatively small amount of distortion by referring to the code book of the 
vector quantization code book memory 18. However, if the input signal in the analysis window is not a speech signal, 
the vector quantization produces a large amount of distortion. Hence, by comparing the vector quantization distortion 
with a predetermined level of distortion, it is possible to decide whether the input signal in the analysis window is a 
speech signal or not. 

[0031] The speech period detecting part 15 decides that a signal over the 400 ms analysis frame period is a speech 
signal when the frequency S F of change in the dynamic measure falls in the range defined by the upper and lower limit 
threshold values and the quantization distortion between the feature vector of the input signal and the corresponding 
representative speech feature vector is smaller than a predetermined value. Although this embodiment uses the vector 
quantization distortion to examine the feature of the spectral envelope, it is also possible to use a time sequence of 
vector quantized codes to determine if it is a sequence characteristic of speech. Further, a method of obtaining a 
speech decision space in a spectral feature space may sometimes be employed. 

[0032] Now a description will be given of an example of an experiment which detects speech by a combination of 
the dynamic measure and the speech feature vector that minimizes the above-mentioned vector quantization distortion. 
This is an example of an experiment for detecting speech from an input signal composed of speech and the singing 
of a bird alternating with each other. In the experiment the vector quantization code book was prepared from a large 
quantity of speech data. As the speech data, 20 speakers' utterances of 50 words and 25 sentences were selected 
from an ATR speech database. The number of quantization points is 512. The feature vector is a 16-dimensional LPC 
cepstrum, the analysis window width is 30 ms and the window shift width 10 ms. The sum of quantization distortions 
of feature vectors provided every 10 ms was calculated using the 400 ms long analysis window shifted in steps of 200 
ms. Similarly, the sum of dynamic measures was also calculated using the 400 ms long analysis window shifted in 
steps of 200 ms. For each of the dynamic measure and the quantization distortion, the range of their acceptable values 
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in the speech period is preset based on learning speech and the speech period is detected when input speech falls in 
the range. 

[0033] The input signal used for evaluation was alternate concatenations of eight sentences each composed of 
speech about 5 sec long and eight kinds of birds' songs each 5 sec long, selected from a continuous speech database 
of the Acoustical Society of Japan. The following measures are set to evaluate the performance of this embodiment. 

Frame detect rate=(the number of correctly detected 
speech frames)/(the number of speech 
frames in evaluation data) 



Correct rate=(the number of correctly detected speech 
frames )/(the number of frames that the 
system output as speech) 

[0034] The correct rate represents the extent to which the result indicated by the system as the speech frame is 
correct. The detect rate represents the extent to which the system could detect speech frames in the input signal. In 
Fig. 7 there are shown, using the above measures, the results of speech detection with respect to the evaluation data. 
The spectrum variation speed of the singing of birds bears a close resemblance to the spectrum variation speed of 
speech; hence, when only the dynamic measure is used, the singing of birds is so often erroneously detected as speech 
that the correct rate is low. With the combined use of the dynamic measure and the vector quantization distortion, the 
spectral envelope of the singing of birds can be distinguished from the spectral envelope of speech and the correct 
rate increases accordingly. 

[0035] Incidentally, in the case of a long vowel such as a diphthong, the spectrum may sometimes undergo no var- 
iation in the vowel period. When speech contains such a vowel, there is a possibility of a detection error arising only 
with the method of the present invention which uses the spectrum variation. By combining this invention method with 
the detection of the pitch frequency, amplitude value or autocorrelation coefficient of the input signal heretofore utilized, 
it is possible to reduce the possibility that the detection error arises. The pitch frequency is the number of vibrations of 
the human vocal cords and ranges from 50 to 500 Hz and distinctly appears in the stationary part of the vowel. That 
is, the pitch frequency component usually has large amplitude (power) and the presence of the pitch frequency com- 
ponent means that the autocorrelation coefficient value in that period is large. Then, by detecting the start and end 
points and periodicity of the speech period through the detection of the frequency of the spectrum variation by this 
invention method and by detecting the vowel part with one or more of the pitch frequency, the amplitude and autocor- 
relation coefficient, it is possible to reduce the possibility of detection errors arising in the case of speech containing a 
long vowel. 

[0036] Fig. 8 illustrates another embodiment of the present invention which combines the Fig. 2 embodiment with 
the vowel detection scheme. No description will be given of steps 12 to 16 in Fig. 8 since they correspond to those in 
Fig. 2. A vowel detecting part 21 detects the pitch frequency, for instance. The vowel detecting part 21 detects the pitch 
frequency in the input signal and provides it to the speech period detecting part 15. The speech period detecting part 
1 5 determines if the frequency S F of the variation in the dynamic measure D(t) is in the predetermined threshold value 
range in the same manner as in the above and decides whether the pitch frequency is in the 50 to 500 Hz range typically 
of human speech. An input signal frame which satisfies these two conditions is detected as a speech frame. In Fig. 8 
the vowel detecting part 21 is shown to be provided separately of the main processing steps 12 through 16, but since 
in practice the pitch frequency, spectral power or autocorrelation value can be obtained by calculation in step 1 3 in the 
course of cepstrum calculation, the vowel detecting part 21 need not always be provided separately. While in Fig. 8 
the detection of the pitch frequency is shown to be used for the detection of the speech vowel period, it is also possible 
to calculate one or more of the pitch frequency, power and autocorrelation value and use them for the decision of the 
speech signal. 

[0037] For the detection of the speech period, the detection of vowel shown in Fig. 8 may be substituted with the 
detection of a consonant. Fig. 9 shows a combination of the detection of the number of zero crossings and the detection 
of the frequency of spectrum variation. Unvoiced fricative sounds mostly have a distribution of 400 to 1400 zero cross- 
ings per second. Accordingly, it is also possible to employ a method which detects the start point of a consonant, using 
a proper aero crossing number threshold value selected by a zero crossing number detecting part 22 as shown in Fig. 9. 
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[0038] The speech period detecting method according to the present invention described above can be applied to 
a voice switch which turns ON and OFF an apparatus under voice control or the detection of speech periods for speech 
recognition. Further, this invention method is also applicable to speech retrieval which retrieves a speech part from 
video information or CD acoustic information data. 

5 [0039] As described above, according to the present invention, since the speech period is detected on the basis of 
the frequency of spectrum variation characteristic of human speech, only the speech period can stably be detected 
even from speech with noise of large power superimposed thereon. And noise of a power pattern similar to that of 
speech can also be distinguished as non-speech when the speed of its spectrum variation differs from the phoneme 
switching speed of speech. Therefore, the present invention can be applied to the detection of the speech period to 

10 be recognized in preprocessing when a speech recognition unit is used in a high-noise environment, or to the technique 
for retrieving a scene of conversations, for instance, from acoustic data of a TV program, movie or similar media which 
contains music or various sounds and for video editing or summarizing its contents. Moreover, the present invention 
permits detection of the speech period with higher accuracy by combining the frequency of spectrum variation with the 
power value, zero crossing number, autocorrelation coefficient or fundamental frequency which is another characteristic 

15 of speech. 

[0040] It will be apparent that many modifications and variations may be effected without departing from the scope 
of the novel concepts of the present invention as defined in the appended claims. 

20 Claims 

1. A signal processing method for detecting a speech period in an input signal, comprising the steps of: 

(a) obtaining a spectral feature parameter by analyzing the spectrum of said input signal for each predeter- 
25 mined analysis window; 

(b) calculating the amount of change in said spectral feature parameter of said input signal per unit time; 

(c) calculating the frequency of variation in the amount of change of said spectral feature parameter over a 
predetermined analysis frame period longer than said unit time; and 

(d) making a check to see if said frequency of variation falls in a predetermined frequency range and, if so, 
30 deciding that said input signal of said analysis frame is a speech signal. 

2. The method of claim 1 , wherein said step of calculating the amount of change in said spectral feature parameter 
comprises a step of obtaining a time sequence of feature vectors representing the spectra of said input signal at 
respective points in time, and a step of calculating dynamic features through the use of said feature vectors at a 

35 plurality of points in time and calculating the variation in the spectrum of said input signal from the norm of said 

dynamic features. 

3. The method of claim 2, wherein said dynamic feature are polynomial expansion coefficients of said feature vectors 
at a plurality of points in time. 

40 

4. The method of claim 1 , 2, or 3, wherein said frequency calculating step is a step of counting the number of peaks 
of said spectrum variation exceeding a predetermined threshold value over said analysis frame and providing the 
count value as said frequency. 

45 5. The method of claim 1 , 2, or 3, wherein said frequency calculating step includes a step of calculating the sum total 
of variations in the spectrum of said input signal over said predetermined analysis frame period longer than said 
unit time and said deciding step decides that said input signal of said analysis frame period is a speech signal 
when said sum total falls in a predetermined range of values. 

50 6. The method of claim 4 or 5 as far as not referring to claim 3, wherein said step of calculating said spectrum variation 
comprises a step of calculating a gradient vector using as its elements linear differential coefficients of respective 
elements of a vector representing said spectral feature parameter, and a step of calculating square-sums of said 
respective elements of said gradient vector as dynamic measures of said spectrum variation. 

55 7. The method of claim 6, wherein said spectral feature parameter is an LPC cepstrum and said spectrum variation 
is a delta cepstrum. 

8. The method of claim 1, further comprising a step of vector quantizing said input signal for each said analysis 
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window by referring to a vector code book composed of representative vectors of spectral feature parameters of 
speech prepared from speech data and calculating quantization distortion, and wherein said deciding step decides 
that said input signal is a speech signal when said quantization distortion is smaller than a predetermined value 
and said frequency of variation is within said predetermined frequency range. 

9. The method of claim 1 , further comprising a step of detecting whether said input signal in said each analysis window 
is a vowel, and wherein said deciding step (d) decides whether said input signal is a speech signal by detecting a 
vowel and by detecting whether said frequency of variation is in said predetermined frequency range. 

10. The method of claim 9. wherein said vowel detecting step detects a pitch frequency in said input signal for said 
each analysis window and decides that said input signal is a vowel when said detected pitch frequency is in a 
predetermined frequency range. 

11. The method of claim 9, wherein said vowel detecting step detects the power of said input signal for said each 
analysis window and decides that said input signal is a vowel when said detected power is larger than a predeter- 
mined value. 

12. The method of claim 9, wherein said vowel detecting step detects the autocorrelation value of said input signal 
and decides that said input signal is a vowel when said detected autocorrelation value is larger than a predetermined 
value. 

13. The method of claim 1 , further comprising a step (e) of counting the number of zero crossings of said input signal 
in said each analysis window and deciding that said input signal in said analysis window is a consonant when said 
count value is within a predetermined range, and wherein said deciding step (d) decides whether said input signal 
is a speech signal by deciding whether said input signal is a consonant by said deciding step (e) and by deciding 
whether said frequency of variation is in said predetermined frequency range. 

14. The method of claim 1 , 2, or 3, wherein said spectral feature parameter is an LPC cepstrum. 

15. The method of claim 1 , 2, or 3, wherein said spectral feature parameter is an FFT cepstrum. 



Patentanspruche 

1. Signalverarbeitungsverfahren zum Erfassen einer Sprachperiode in einem Eingangssignal, mit den Schritten: 

(a) Erhalten eines spektralen Merkmalparameters durch Analysieren des Spektrums des Eingangssignals fur 
jedes vorgegebene Analysefenster; 

(b) Berechnen des Ausmaftes der Anderung des spektralen Merkmalparameters des Eingangssignals pro 
Zeiteinheit; 

(c) Berechnen der Anderungsfrequenz des AusmafJes der Anderung des spektralen Merkmalparameters uber 
eine vorgegebene Analyserahmenperiode, die langer als die Zeiteinheit ist; und 

(d) Uberprufen, ob die Frequenz der Anderung in einen vorgegebenen Frequenzbereich fallt, und, wenn ja, 
Entscheiden, daft das Eingangssignal des Analyserahmens ein Sprachsignal ist. 

2. Verfahren nach Anspruch 1, bei dem der Schritt des Berechnens des Ausmafces der Anderung des spektralen 
Merkmalparameters einen Schritt des Erhaltens einer Zeitfolge von Merkmalvektoren, die die Spektren des Ein- 
gangssignals an jeweiligen Zeitpunkten darstellen, und einen Schritt des Berechnens von dynamischen Merkmalen 
durch Verwendung der Merkmalvektoren an einer Mehrzahl von Zeitpunkten und des Berechnens der Anderung 
im Spektrum des Eingangssignals aus der Norm der dynamischen Merkmale umfaGt. 

3. Verfahren nach Anspruch 2, bei dem das dynamische Merkmal Polynom-Expansionskoeffizienten der Merkmal- 
vektoren an einer Mehrzahl von Zeitpunkten sind. 

4. Verfahren nach Anspruch 1,2 oder 3, bei dem der Schritt des Berechnens der Frequenz ein Schritt des Zahlens 
der Anzahl der einen vorgegebenen Schwellwert uberschreitenden Peaks der Spektrumsanderung in dem Ana- 
lyserahmen und des Lieferns des Zahlwertes als die Frequenz ist. 
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5. Verfahren nach Anspruch 1 , 2 oder 3, bei dem der Schritt des Berechnens der Frequenz einen Schritt des Be- 
rechnens der Gesamtsumme der Anderungen im Spektrum des Eingangssignals in der vorgegebenen Analyse- 
rahmenperiode, die langer ats die Zeiteinheit ist, umfafJt, und der Schritt des Entscheidens entscheidet, dafi das 
Eingangssignal der Analyserahmenperiode ein Sprachsignal ist, wenn die Gesamtsumme in einen vorgegebenen 

5 Wertebereich fallt. 

6. Verfahren nach Anspruch 4 oder 5, soweit nicht auf Anspruch 3 bezogen, bei dem der Schritt des Berechnens der 
Spektrumsanderung einen Schritt des Berechnens eines Gradientenvektors, als dessen Elemente lineare Diffe- 
rentialkoeffizienten von jeweiligen Elementen eines den spektralen Merkmalparameter darstellenden Vektors ver- 

10 wendet werden, und einen Schritt des Berechnens von Quadratsummen der jeweiligen Elemente des Gradien- 

tenvektors als dynamische Messwerte der Spektrumsanderung umfafit. 

7. Verfahren nach Anspruch 6, bei dem der spektrale Merkmalparameter ein LPC-Cepstrum ist und die Spektrum- 
anderung ein Delta-Cepstrum ist. 

8. Verfahren nach Anspruch 1 , ferner mit einem Schritt des vektoriellen Quantisierens des Eingangssignals fur jedes 
derAnalysefensterdurch Bezugnahmeauf ein Vektorcodebuch, das aufgebautistaus aus Sprachdaten erhaltenen 
reprasentativen Vektoren von spektralen Merkmalparametern von Sprache, und des Berechnens von Quantisie- 
rungsverzerrung, wobei in dem Schritt des Entscheidens entschieden wird, dafc das Eingangssignal ein Sprach- 
signal ist, wenn die Quantisierungsverzerrung kleiner als ein vorgegebener Wert ist und die Frequenz der Anderung 
innerhalb des vorgegebenen Frequenzbereiches liegt. 

9. Verfahren nach Anspruch 1 , ferner mit einem Schritt des Erfassens, ob das Eingangssignal in einem jeweiligen 
Analysefenster ein Vokal ist, und wobei in dem Entscheidungsschritt (d) entschieden wird, ob das Eingangssignal 

25 ein Sprachsignal ist, indem ein Vokal erfafit wird und erfafit wird, ob die Frequenz der Anderung in dem vorgege- 

benen Frequenzbereich liegt. 

10. Verfahren nach Anspruch 9, bei dem indem Vokalerfassungsschritt eineTonho^enfrequenz in dem Eingangssignal 
fur jedes Analysefenster erfaftt wird und entschieden wird, dafi das Eingangssignal ein Vokal ist, wenn die erfafite 

30 Tonhohenfrequenz in einem vorgegebenen Frequenzbereich liegt. 

11. Verfahren nach Anspruch 9, bei dem in dem Vokalerfassungsschritt die Leistung des Eingangssignals fur jedes 
Analysefenster erfaftt wird und entschieden wird, dafi das Eingangssignal ein Vokal ist, wenn die erfafite Leistung 
grofier als ein vorgegebener Wert ist. 

35 

12. Verfahren nach Anspruch 9, bei dem in dem Vokalerfassungsschritt der Autokorrelationswert des Eingangssignals 
erfalit wird und entschieden wird, daft das Eingangssignal ein Vokal ist, wenn der erfafite Autokorrelationswert 
grofier als ein vorgegebener Wert ist. 

13. Verfahren nach Anspruch 1, ferner mit einem Schritt (e) des Zahlens der Anzahl von Nulldurchgangen des Ein- 
gangssignals in jedem Analysefenster und des Entscheidens, dafi das Eingangssignal in dem Analysefenster ein 
Konsonant ist, wenn der Zahlwert innerhalb eines vorgegebenen Bereiches liegt, und wobei in dem Entschei- 
dungsschritt (d) entschieden wird, ob das Eingangssignal Sprache ist, indem durch den Entscheidungsschritt (e) 
entschieden wird, ob das Eingangssignal ein Konsonant ist und entschieden wird, ob die Anderungsfrequenz in 
dem vorgegebenen Frequenzbereich liegt. 

14. Verfahren nach Anspruch 1, 2 oder 3, bei dem der spektrale Merkmalparameter ein LPC-Cepstrum ist. 

15. Verfahren nach Anspruch 1 , 2 oder 3, bei dem der spektrale Merkmalparameter ein FFT-Cepstrum ist. 



Revendications 

1. Procede de traitement de signal destine a detecter une periode de parole dans un signal d'entree, comprenant 
55 les etapes consistant : 

(a) a obtenir un parametre de particularity spectrale en analysant le spectre dudit signal d'entree pour chaque 
fenetre d'analyse predeterminee ; 
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(b) a calculer ia valeur de changement dudit parametre de particularite spectrale dudit signal d'entree par unite 
de temps ; 

(c) a calculer la frequence de variation de la valeur de changement dudit parametre de particularite spectrale 
sur une periode predeterminee de trame analysee plus longue que ladite unite de temps ; et 

(d) a faire un controle pour voir si ladite frequence de variation tombe dans une plage de frequences prede- 
terminee et, s'il en est ainsi, a decider que ledit signal d'entree de ladite trame analysee est un signal de parole. 

2. Precede selon la revendication 1, dans lequel ladite etape consistant a calculer la valeur de changement dudit 
parametre de particularite spectrale comprend une etape consistant a obtenir une sequence temporelle de vecteurs 
de particularite representant les spectres dudit signal d'entree a des points respectifs dans le temps, et une etape 
consistant a calculer des parti cula rites dynamiques par I'utilisation desdits vecteurs de particularite a une pluralite 
de points dans le temps et a calculer la variation du spectre dudit signal d'entree par rapport a la norme desdites 
particularites dynamiques. 

3. Procede selon la revendication 2, dans lequel ladite particularite dynamique est constitute par des coefficients 
d'extension de polynome desdits vecteurs de particularite en une pluralite de points dans le temps. 

4. Procede selon la revendication 1 , 2, ou 3, dans lequel ladite etape consistant a calculer la frequence est une etape 
consistant a compter le nombre de pics de ladite variation de spectre qui depassent une valeur de seuil predeter- 
minee sur ladite trame analysee et a delivrer la valeur comptee en tant que ladite frequence. 

5. Procede selon la revendication 1 , 2 ou 3, dans lequel ladite etape consistant a calculer la frequence cornprend 
une etape consistant a calculer la somme totale des variations du spectre dudit signal d'entree sur ladite periode 
de trame analysee predeterminee plus longue que ledit temps unitaire et dans lequel ladite etape de decision 
decide que ledit signal d'entree de ladite periode de trame analysee est un signal de parole lorsque ladite somme 
totale tombe dans une plage de valeurs predeterminee. 

6. Procede selon la revendication 4 ou 5, dans la mesure ou elie ne se refere pas a la revendication 3, dans lequel 
ladite etape consistant a calculer ladite variation de spectre comprend une etape consistant a calculer un vecteur 
de gradient utilisant, comme ses elements, des coefficients differentiels lineaires d'elements respectifs d'un vecteur 
representant ledit parametre de particularite spectrale, et une etape consistant a calculer des sommes de carres 
desdits elements respectifs dudit vecteur de gradient en tant que mesures dynamiques de ladite variation de 
spectre. 

7. Procede selon ia revendication 6, dans lequel ledit parametre de particularite spectrale est un cepstre de LPC 
(codage predictif lineaire) et dans lequel ladite variation de spectre est un cepstre de delta. 

8. Procede selon la revendication 1 , comprenant en outre une etape consistant a effectuer une quantification vecto- 
rielle dudit signal d'entree pour chaque dite fenetre d'analyse en se referant a un dictionnaire de codes de vecteurs 
compose de vecteurs representatifs de parametres de particularite spectrale de la parole, prepares a partir de 
donnees de parole, et a calculer la distorsion de quantification, et dans lequel ladite etape de decision decide que 
ledit signal d'entree est un signal de parole lorsque ladite distorsion de quantification est plus petite qu'une valeur 
predeterminee et que ladite frequence de variation est a I'interieur de ladite plage de frequences predeterminee. 

9. Procede selon la revendication 1 , comprenant en outre une etape consistant a detecter le fait que ledit signal 
d'entree dans chaque dite fenetre d'analyse est, ou non, une voyelle, et dans lequel ladite etape (d) de decision 
decide si ledit signal d'entree est un signal de parole par la detection d'une voyelle et par ia detection du fait que 
ladite frequence de variation est dans ladite plage de frequences predeterminee. 

10. Procede selon la revendication 9, dans lequel ladite etape consistant a detecter une voyelle detecte une frequence 
de hauteur de son dans ledit signal d'entree pour chaque dite fenetre d'analyse et decide que ledit signal d'entree 
est une voyelle lorsque ladite frequence de hauteur de son detectee est dans une plage de frequences predeter- 
minee. 

11. Procede selon ia revendication 9, dans lequel ladite etape consistant a detecter une voyelle detecte la puissance 
dudit signal d'entree pour chaque dite fenetre d'analyse et decide que ledit signal d'entree est une voyelle lorsque 
ladite puissance detectee est plus grande qu'une valeur predeterminee. 
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12. Procede selon la revendication 9, dans lequel ladite etape consistant a detecter une voyelle detecte la valeur 
d'autocorrelation dudit signal d'entree et decide que ledit signal d'entree est une voyelle lorsque ladite valeur 
d'autocorrelation detectee est plus grande qu'une valeur predeterminee. 

5 13. Procede selon la revendication 1 , comprenant en outre une etape (e) consistant a compter le nombre de passages 
par zero dudit signal d'entree dans chaque dite fenetre d'analyse et a decider que ledit signal d'entree dans ladite 
fenetre d'analyse est une consonne lorsque ladite valeur comptee est a I'interieur d'une plage predeterminee, et 
dans lequel ladite etape (d) de decision decide si ledit signal d'entree est un signal de parole en decidant si ledit 
signal d'entree est une consonne a I'aide de ladite etape (e) de decision et en decidant si ladite frequence de 

10 variation est dans ladite plage de frequences predeterminee. 

14. Procede selon la revendication 1, 2 ou 3, dans lequel ledit parametre de particuiarite spectrale est un cepstre de 
LPC. 

15 15. Procede selon la revendication 1, 2 ou 3, dans lequel ledit parametre de particuiarite spectrale est un cepstre de 
TFR (transformee de Fourier rapide). 
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